Latent Dirichlet Allocation

Motivation

Specification

original version:




smoothed version:


Parameters for original version:

Parameters for smoothed version:

αβθd=1Mϕk=1Kzd=1M,n=1Ndwd=1M,n=1NdA Dirichlet hyperprior, either a constant or a random variableA Dirichlet hyperprior, either a constant or a random variableDirichletK(α)DirichletV(β)CategoricalK(θd)CategoricalV(ϕzdn)

In the orginal version,
Joint distribution:

p(θ,z,w|α,β)=p(θ|α)n=1Np(zn|θ)p(wn|zn,β)

The marginal distribution of a document:
p(w|α,β)=p(θ|α)n=1Nznp(zn|θ)p(wn|zn,β)dθ

Posterior distribution of hidden variables:
p(θ,z|w,α,β)=p(θ,z,w|α,β)p(w|α,β)

This posterior is intractable to compute, due to the coupling between θ and β.

Variational distribution

Here we come up with a variational distribution on latent variables, which can be decomposed as:

q(θ,z|γ,ϕ)=q(θ,γ)n=1Nq(zn|ϕn)

and optimize:
(γ,ϕ)=argminγ,ϕDKL(q(θ,z|γ,ϕ)p(θ,z|w,α,β))

to minimize the difference between the variational distribution and the true posterior distribution.
Play with the formula above and we will get:
DKL(qp)+L(γ,ϕ;α,β)=logp(w|α,β)=constant for γ,ϕ

where
L(γ,ϕ;α,β)=Eq[logp(θ,z,w|α,β)]Eq[logq(θ,z)]

So minimizing the KL divergence is equivalent to maximizing the function L as the lower bound of logp(w|α,β):
(γ,ϕ)=argmaxγ,ϕL(γ,ϕ;α,β)

Parameter estimation

Variational EM algorithm:

E-step: variational inference

A few more steps:

L(γ,ϕ;α,β)=Eq[logp(θ,z,w|α,β)]Eq[logq(θ,z)]=Eq[logp(θ|α)]+Eq[p(z|θ)]+Eq[p(w|z,β)]Eq[logq(θ)]Eq[logq(z)]

Struggle through heavy math to compute each term and we finally get (ψ is the digamma function):




Taking derivatives of this function and set derivatives to zero yields the update formulas.
The variational inference algorithm update γ and ϕ alternately until convergence:



M-step

Maximize L(γ,ϕ;α,β) with respect to β:

Lβ=d=1Mn=1Ndi=1Kj=1Vϕdniwjdnlogβij+i=1Kλij=1Vβij1

Taking the derivative with respect to βij and setting it to zero:
βijd=1Mn=1Ndϕdniwjdn

Maximize L(γ,ϕ;α,β) with respect to α:

Lα=d=1MlogΓj=1Kαji=1KlogΓ(αi)+i=1K(αi1)ψ(γdi)ψj=1Kγdj

Taking the derivative with respect to αi:
Lαi=Mψj=1Kαjψ(αi)+d=1Mψ(γdi)ψj=1Kγdj

It is difficult to compute αi by setting the derivative to zero. So we compute the Hessian Matrix by: (if i=j then δ(i,j)=1 or 0 otherwise)
2Lαiαj=Mψj=1Kαjδ(i,j)ψ(αi)

and input this Hessian Matrix and the derivative to Newton Method to get α.

Gibbs Sampling

(for smoothed version)
Theoretical analysis:




due to conjugate prior. Note that the normalizer of the first term is omitted, because the sum is the length of each document, which is fixed, while the second denominator might change after each update.

Algorithm:




Comparisons and discussions for MCMC and Variational Bayes see Variational Bayes.

Extensions

Relaxing the assumptions:

incorporating meta-data:

authors, links, other labels(supervised)...

other problems

model checking, visualization, data discovery...

References

Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.
"Machine Learning" Lecture 19: http://www.umiacs.umd.edu/~jbg/teaching/CSCI_5622/
"Probabilistic Models for Unsupervised Learning" Lecture 5: http://home.cse.ust.hk/~lzhang/teach/6931a/
Dirichlet-Multinomial Distribution: https://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution#A_combined_example:_LDA_topic_models
Darling, William M. "A theoretical and practical implementation tutorial on topic modeling and gibbs sampling." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.